9 research outputs found

    Shortest Unique Substring Query Revisited

    Get PDF
    We revisit the problem of finding shortest unique substring (SUS) proposed recently by [6]. We propose an optimal O(n)O(n) time and space algorithm that can find an SUS for every location of a string of size nn. Our algorithm significantly improves the O(n2)O(n^2) time complexity needed by [6]. We also support finding all the SUSes covering every location, whereas the solution in [6] can find only one SUS for every location. Further, our solution is simpler and easier to implement and can also be more space efficient in practice, since we only use the inverse suffix array and longest common prefix array of the string, while the algorithm in [6] uses the suffix tree of the string and other auxiliary data structures. Our theoretical results are validated by an empirical study that shows our algorithm is much faster and more space-saving than the one in [6]

    Statistical morphological disambiguation with application to disambiguation of pronunciations in Turkish /

    Get PDF
    The statistical morphological disambiguation of agglutinative languages suffers from data sparseness. In this study, we introduce the notion of distinguishing tag sets (DTS) to overcome the problem. The morphological analyses of words are modeled with DTS and the root major part-of-speech tags. The disambiguator based on the introduced representations performs the statistical morphological disambiguation of Turkish with a recall of as high as 95.69 percent. In text-to-speech systems and in developing transcriptions for acoustic speech data, the problem occurs in disambiguating the pronunciation of a token in context, so that the correct pronunciation can be produced or the transcription uses the correct set of phonemes. We apply the morphological disambiguator to this problem of pronunciation disambiguation and achieve 99.54 percent recall with 97.95 percent precision. Most text-to-speech systems perform phrase level accentuation based on content word/function word distinction. This approach seems easy and adequate for some right headed languages such as English but is not suitable for languages such as Turkish. We then use a a heuristic approach to mark up the phrase boundaries based on dependency parsing on a basis of phrase level accentuation for Turkish TTS synthesizers

    Ψ-RA: a parallel sparse index for genomic read alignment

    Get PDF
    Background Genomic read alignment involves mapping (exactly or approximately) short reads from a particular individual onto a pre-sequenced reference genome of the same species. Because all individuals of the same species share the majority of their genomes, short reads alignment provides an alternative and much more efficient way to sequence the genome of a particular individual than does direct sequencing. Among many strategies proposed for this alignment process, indexing the reference genome and short read searching over the index is a dominant technique. Our goal is to design a space-efficient indexing structure with fast searching capability to catch the massive short reads produced by the next generation high-throughput DNA sequencing technology. Results We concentrate on indexing DNA sequences via sparse suffix arrays (SSAs) and propose a new short read aligner named Ψ-RA (PSI-RA: parallel sparse index read aligner). The motivation in using SSAs is the ability to trade memory against time. It is possible to fine tune the space consumption of the index based on the available memory of the machine and the minimum length of the arriving pattern queries. Although SSAs have been studied before for exact matching of short reads, an elegant way of approximate matching capability was missing. We provide this by defining the rightmost mismatch criteria that prioritize the errors towards the end of the reads, where errors are more probable. Ψ-RA supports any number of mismatches in aligning reads. We give comparisons with some of the well-known short read aligners, and show that indexing a genome with SSA is a good alternative to the Burrows-Wheeler transform or seed-based solutions. Conclusions Ψ-RA is expected to serve as a valuable tool in the alignment of short reads generated by the next generation high-throughput sequencing technology. Ψ-RA is very fast in exact matching and also supports rightmost approximate matching. The SSA structure that Ψ-RA is built on naturally incorporates the modern multicore architecture and thus further speed-up can be gained. All the information, including the source code of Ψ-RA, can be downloaded at: http://www.busillis.com/o_kulekci/PSIRA.zip webcite

    Range Selection And Predecessor Queries In Data Aware Space And Time

    No full text
    On a given vector X=〈x1,x2,…,xn〉 of integers, the range selection (i,j,k) query is finding the k-th smallest integer in 〈xi,xi+1,…,xj〉 for any (i,j,k) such that 1≤i≤j≤n, and 1≤k≤j−i+1. Previous studies on the problem proposed data structures that occupied additional O(n⋅log⁡n) bits of space over the X itself that answer the queries in logarithmic time. In this study, we replace X and encode all integers in it via a single wavelet tree by using S=n⋅log⁡u+∑log⁡xi+o(n⋅log⁡u+∑log⁡xi) bits, where u is the number of distinct ⌊log⁡xi⌋ values observed in X. Notice that u is at most 32 (64) for 32-bit (64-bit) integers and when xi\u3eu, the space used for xi in the proposed data structure is less than the Elias-δ coding of xi. Besides data-aware coding of X, the range selection is performed in O(log⁡u+log⁡x′) time where x′ is the k-th smallest integer in the queried range. This result is adaptive and achieves the range selection regardless of the size of X, and the time-complexity depends on the answer itself. Additionally, we can answer range predecessor queries (i,j,x′): return the largest element y≤x′ in 〈xi,xi+1,…,xj〉, in O(log⁡u+log⁡x′) time. In summary, to the best of our knowledge, we present the first algorithms using data-aware space and time for the general range selection and predecessor problems
    corecore